The report explores a dataset containing quality and 11 features for 1599 red wines observations.
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
Looking for the number of NA values for each column in the dataframe. It appears that none are missing.
## [,1]
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
Correlation showing all variables against quality. It appears that four attributes have a weak to moderate correlation (either negative or positive) with quality: volatile.acidity, citric.acid, sulphates, and alcohol.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
Correlations between all variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The qualities conform to a fairly normal distribution. While the scores limits were 0-10, no wines fell below 3 or scored above 8 and most falling below a 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] 0.6703331
Volatile acidity is positively skewed.
## [1] 1599 12
## [1] 1580 13
## [1] 19
## [1] -0.3634752
Removal of outliers did not improved the correlation with quality.
Squareroot of squareroot appears to give the best normal distribution.
## [1] -0.3934108
But it doesn’t create a much stronger correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] 0.3177403
Citric acid appears to be positively skewed, but the jump around .5 reduces the skewness measure.
## [1] 0.2324389
The removal of outliers did not improve correlation which makes sense since the chart doesn’t appear to show any outliers.
Through all the transforms, it appears that squareroot creates the most normal distribution, but still has a large number of wines with almost no citric acid.
## [1] 0.2066822
And the squareroot actually lowers the correlation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] 2.424118
Sulphates are highly positively skewed.
## [1] 0.3940654
Removing sulphate significantly improves correlation with quality from .25 to .39
Reciprical transform shows the best normal distribution.
## [1] -0.3403317
This actually increased correlation and turned it negative. We’ll explore both options in the bivariate section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## [1] 0.8592144
Alcohol has a positive skew.
## [1] 0.4710238
Removing alcohol outliers did not improve correlation with quality.
The best normalization is created by the squareroot transform.
## [1] 0.4768205
No real change in corrlation, but we’ll plot both in the bivariate section.
There are 1,599 red wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality).
The variables volatile acidity, citric acid, sulphates, and alcohol have the highest correlations with quality.
For the four features with the highest correlations, I removed outliers and reperformed correlations with quality. This only made a significant increase in correlation with sulphates.
I also performed log, squareroot, squareroot of squareroot, cube root, and reciprocal transforms for all four features with the highest correlation to determine if more normal distribution could be created and higher correlations with quality could be confirmed. Higher correlations were not achieved.
The negative correlation is obvious in the trendline between volatile acidity and quality.
The weak positive correlation can be seen in the trendline between citric acid and quality.
The reciprocal sulphates show a much stronger negative trend with quality.
Both alchohol and squareroot transform appear to have the strongest trendlines we’ve seen with quality.
Now we’ll see see how the four attributes with the hightest correlations with quality correlate with each other.
## [1] -0.5524957
Pretty strong negative correlation between volatile acidity and citric acid, but both of those attributes could be correlated merely because they are acids in the wine.
## [1] -0.2609867
Weak correlation between volatile acidity and sulphates.
## [1] -0.202288
Weak correlation between volatile acidity and alcohol.
## [1] 0.31277
Medium correlation between citric acid and sulphates.
## [1] 0.1099032
Very little correlation between citric acid and alcohol.
## [1] 0.09359475
Very little correlation between sulphates and alchohol.
Alcohol appears the be the highest correlation with quality followed by sulphates as far as positive correlations. Volatile acidity, citric acidity, and sulphates have medium correlations with each other.
The most interesting item is that positively correlated with each other, both citric acid and sulphates have very low correlations with alcohol. This points to a combination of either alcohol and citric acid or sulphates being excellent attributes to use in prediction.
Showing the qualities bucketed by alcohol, it’s obvious that alcohol have a greater impact since no quality 8 shows below the 10.5-12 bucket, but the line charts do show the negative correlation with volatile acidity as the quality lines gradually move lower on the graph.
The positive correlations of both features are evident in both charts, but also the relatively low correlation values. There is a large spread for citric acid at the 12-16 bucket of alcohol. Maybe citric acid has a larger importance in determining quality at lower levels of alcohol content.
Sulphates show an obvious positive correlation with quality in both charts and appear to be less dependent on the alcohol quantities than citric acid.
The box plots revealed a lot of nuances in the data.
Volatile acidity’s negative correlation appears to come mostly from extremely high amounts. This is shown in the 10.5-12 alcohol bucket where the 3 quality factor shows a large spike in the amount of volatile acidity.
Citric acid appears to have more of an impact on quality while in the lower alcohol buckets. Once the alcohol hits the highest bucket, the citric acid ranges for quality levels 6, 7, and 8 have large spreads and overlap each other.
Sulphates appear to have fairly tight ranges for all quality levels in each of the alcohol buckets. In addition, sulphate quality levels appears to be less impacted by the alcohol buckets.
Alcohol had the highest positive correlation with wine quality. This makes sense as one of the primary reason to have an alcoholic beverage in the first place is for alcohol. At around 7 level quality the vast majority of those wines contain an alcohol percentage greater than 10%.
Sulphates had the second highest positive correlation with quality. Sulphates are additives to wines which acts as antimicrobial and antioxidant agents. These preserve the wines so perhaps an increase in sulphates would produce less likelihood that the wine tasted would have gone bad.
This shows little correlation between the two highest positively correlated attributes to wine quality in the dataset, sulphates and alcohol. Since we’re ultimately trying to find the attributes which influence the quality of wine and possibly to predict the quality based on these attributes, it’s important that the features are not redundant. Redundant attributes lead to a model which overfits predictions.
As wine quality was pretty much a categorical value containing mostly values of 5 or 6, these highly influenced the appearance of the graphs correlating with quality. I was hoping that some of the transforms would give a higher correlation with quality than just the normal attribute, but I didn’t see any real evidence of this with the transforms I created.
Some limitations are due to the volume of data. 1599 records is not a large dataset, perhaps I should have chosen the white wines instead. To investigate the data further, I would like to see a larger set. In addition, while the quality measure was a median of three wine experts, I would also like to see the mean in order to show a more continuous variable quality measurement.